mkdir -p /mnt/storage/$USER/jupyternotebooks/assignment1/functional_analysis_gene_signatures/
cd /mnt/storage/$USER/jupyternotebooks/assignment1/functional_analysis_gene_signatures/
pwd
ln -sf /mnt/storage/$USER/jupyternotebooks/assignment1/RNA-seq/deseq.results.tsv .
head deseq.results.tsv
#fold change, pvalue
head -1 deseq.results.tsv
grep -n Eya1 deseq.results.tsv
grep -n Six2 deseq.results.tsv
#-n, --line-number print line number with output lines
# --line-buffered flush output on every line
# Eya1 and Six2 genes are strongly download-regulated (logFC < -2) and p.adj << 0.05
awk '$3 != "NA" && $3 > 1 && $7 < 0.05 {print ;}' deseq.results.tsv | head
#filter fold change >1 and pvalue <0.05
awk '$3 != "NA" && $3 > 1 && $7 < 0.05 {print $1}' deseq.results.tsv | grep -v baseMean > up-logFC1-p05.txt
awk '$3 != "NA" && $3 < -1 && $7 < 0.05 {print $1}' deseq.results.tsv | grep -v baseMean > down-logFC1-p05.txt
wc -l up-logFC1-p05.txt
wc -l down-logFC1-p05.txt
all 917 up-regulated genes are found
351 down-regulated genes are found
cannot find pathway enricment => use gProfilter
up-logFC1-p05
MF: molecular function
BP: biological process
CC: cellular componen
KEGG: pathway
REAC: reactome pathway
WP: wiki biological pathway
TF: motif discovery, find potential uptream regulators which mofits are rich near these genes
MIRNA: microRNA binding site which are in 3'UTR
The significantly up regulated genes are related to protein binding,system develop,cell periphery,myogenin motif,cortisol synthesis and secretion.
down-logFC1-p05
The significantly reduced expression genes are related to protein binding,anotomical structure development,cellular anatomical entity,BEN motif, and oocyte miosis.
cat deseq.results.tsv | sort -k 3,3gr | awk '$3 != "NA" {print $1}' | grep -v Gene | head
# 3 => third column logFC
# gr => general sort, reverse
cat deseq.results.tsv | sort -k 3,3gr | awk '$3 != "NA" {print $1}' | grep -v Gene | grep -n Eya1
cat deseq.results.tsv | sort -k 3,3gr | awk '$3 != "NA" {print $1}' | grep -v Gene > deseq.results.sortFCdesc.txt
Identify enriched GO terms with ranked gene names. Choose file -> deseq.results.sortFCdesc.txt
http://cbl-gorilla.cs.technion.ac.il/GOrilla/lqupm4zw/GOResults.html
see G protein-coupled receptor signaling pathway
Do the same, but now for a ranked list in ascending order (most down-regulated gene on top):
cat deseq.results.tsv | sort -k3,3g | awk '$3 != "NA" {print $1}' | grep -v Gene > deseq.results.sortFCasc.txt
head deseq.results.sortFCasc.txt
grep -n Eya1 deseq.results.sortFCasc.txt
grep -n Eya1 deseq.results.sortFCdesc.txt
Both the significant up and down-regulated genes are aoosicated with G protein-coupled receptor signaling and nervous system about sensory perception.
A ranking-based enrichment analysis.
cat deseq.results.tsv | sort -k 3,3gr | awk '$3 != "NA" {print $1, $3}' | grep -v Gene | head
cat deseq.results.tsv | sort -k 3,3gr | awk '$3 != "NA" {print $1, $3}' | grep -v Gene | grep -v baseMean | tr ' ' '\t' > deseq.logFC.rnk
grep -n Eya1 deseq.logFC.rnk
#-n => gives the line number
head deseq.logFC.rnk
download deseq.logFC.rnk-> load data-> browse for file-> ok-> run GSEA preRank -> Gene sets database: use "Reactome" (mouse no KEGG)-> run
In the top part (na_pos), the genes related to receptor-type tyrosine-protein phosphatases, neurotransmitter, SMAD(regulates cellular processes) and nuclear receptor transcription pathway are enriched.
Positive
\
most genes are up-regulated
In the negative part, FGFR2(bone growth), FGFR4 (conduct signal), FGFR1 (neurons formation) and Chromosome maintenance associated pathways are significantly enriched.
Negative
Find enriched transcription factor motifs in the upstream (and intronic, downstream) sequences of this gene set.
In Cytoscape, do File->Import->Network->File, and choose the text file with all the up-regulated genes --> run iRegulon.
The highest motif is homer and the second is REST. They have the same seed motif sequence.
A network of predicted REST targets
http://www.pantherdb.org/geneListAnalysis.do \ use pie chart to show all REST target gene in iRegulon belong to which GO term
cellular process 24.2%, 18.0% biological regulation
1269 differentially expressed genes were found (917 upregulated genes and 352 downregulated genes in knock-out group)
Using leading edge is sometimes better than arbitrary threshold, because the leading edge can identify some genes that are not in our selection of the arbitrary threshold (and everyone's threshold are different that cannot compare with each other)
I also use Vocano plot to identify the top 10 Padj of the significant genes\ => knock-out Eya1 genes affect the Podxl, Chga and Pax2 expression which play an important role in kidney cell.
Intriguingly, the Chga is also a REST target gene (iRegulon), maybe Eya1 bind to Chga depend on REST trancription factor. \ Although almost differencial expression gene are not associated with NPCs identity, at least we identify the Chga which is related to REST and Eya1. \ Maybe we can speculate this topic is a innovative filed and this research identified REST-binding motifs for the first time.